speech response
SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement
Ge, Yuan, Zhang, Junxiang, Liu, Xiaoqian, Li, Bei, Ma, Xiangnan, Wang, Chenglong, Ye, Kaiyang, Du, Yangfan, Zhang, Linfeng, Huang, Yuxin, Xiao, Tong, Yu, Zhengtao, Zhu, JingBo
Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (5 more...)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.93)
Empathy Omni: Enabling Empathetic Speech Response Generation through Large Language Models
Wang, Haoyu, Zhang, Guangyan, Chen, Jiale, Li, Jingyu, Wang, Yuehai, Guo, Yiwen
With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models only convert response content into speech without fully capturing the rich emotional cues in user queries, where the same sentence may convey different meanings depending on the expression. Emotional understanding is thus essential for improving human-machine interaction. Most empathetic speech LLMs rely on massive datasets, demanding high computational cost. A key challenge is to build models that generate empathetic responses with limited data and without large-scale training. To this end, we propose Emotion Omni, a model that understands emotional content in user speech and generates empathetic responses. We further developed a data pipeline to construct a 200k emotional dialogue dataset supporting empathetic speech assistants. Experiments show that Emotion Omni achieves comparable instruction-following ability without large-scale pretraining, while surpassing existing models in speech quality (UTMOS:4.41) and empathy (Emotion GPT Score: 3.97). These results confirm its improvements in both speech fidelity and emotional expressiveness. Demos are available at https://w311411.github.io/omni_demo/.
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models
Chiang, Cheng-Han, Wang, Xiaofei, Li, Linjie, Lin, Chung-Ching, Lin, Kevin, Liu, Shujie, Wang, Zhendong, Yang, Zhengyuan, Lee, Hung-yi, Wang, Lijuan
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.
- Asia (0.92)
- North America > United States (0.67)
- Information Technology > Artificial Intelligence > Speech (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Zhang, Haonan, Luo, Run, Liu, Xiong, Wu, Yuchuan, Lin, Ting-En, Zeng, Pengpeng, Qu, Qiang, Fang, Feiteng, Yang, Min, Gao, Lianli, Song, Jingkuan, Huang, Fei, Li, Yongbin
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Fang, Qingkai, Zhou, Yan, Guo, Shoutao, Zhang, Shaolei, Feng, Yang
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
Gao, Heting, Shao, Hang, Wang, Xiong, Qiu, Chaofan, Shen, Yunhang, Cai, Siqi, Shi, Yuchen, Xu, Zihan, Long, Zuwei, Zhang, Yike, Dong, Shaoqi, Fu, Chaoyou, Li, Ke, Ma, Long, Sun, Xing
The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.
- Leisure & Entertainment (0.48)
- Health & Medicine > Therapeutic Area (0.34)
- Media (0.34)
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Zhang, Xin, Lyu, Xiang, Du, Zhihao, Chen, Qian, Zhang, Dong, Hu, Hangrui, Tan, Chaohong, Zhao, Tianyu, Wang, Yuxuan, Zhang, Bin, Lu, Heng, Zhou, Yaqian, Qiu, Xipeng
Current methods of building LLMs with voice interaction capabilities rely heavily on explicit text autoregressive generation before or during speech response generation to maintain content quality, which unfortunately brings computational overhead and increases latency in multi-turn interactions. To address this, we introduce IntrinsicVoice, an LLM designed with intrinsic real-time voice interaction capabilities. IntrinsicVoice aims to facilitate the transfer of textual capabilities of pre-trained LLMs to the speech modality by mitigating the modality gap between text and speech. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences while generating high-quality audio, significantly reducing the length difference between speech and text, speeding up inference, and alleviating long-text modeling issues. Additionally, we construct a multi-turn speech-to-speech dialogue dataset named IntrinsicVoice-500k which includes nearly 500k turns of speech-to-speech dialogues, and a cross-modality training strategy to enhance the semantic alignment between speech and text. Experimental results demonstrate that IntrinsicVoice can generate high-quality speech response with latency lower than 100ms in multi-turn dialogue scenarios. Demos are available at https://instrinsicvoice.github.io/. Large language models (LLMs) (Yang et al., 2024; Dubey et al., 2024; OpenAI, 2023) and multimodal large language models (MLLMs) (Tang et al., 2023; Chu et al., 2024; Liu et al., 2024) have exhibited exceptional performance across a variety of natural language processing tasks and multimodal comprehension tasks, allowing them to become powerful solvers for general tasks.
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Fang, Qingkai, Guo, Shoutao, Zhou, Yan, Ma, Zhengrui, Zhang, Shaolei, Feng, Yang
Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future. Large language models (LLMs), represented by ChatGPT (OpenAI, 2022), have become powerful general-purpose task solvers, capable of assisting people in daily life through conversational interactions. However, most LLMs currently only support text-based interactions, which limits their application in scenarios where text input and output are not ideal. Recently, the emergence of GPT-4o (OpenAI, 2024) has made it possible to interact with LLMs through speech, responding to user's instruction with extremely low latency and significantly enhancing the user experience.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
How IBM Is Employing AI To Predict Alzheimer's Disease
IBM researchers then used NLP to analyse the participants' language sample transcripts. The model picked up tiny subtleties and changes in discourses that are generally missed if done manually. Based on this, IBM researchers trained the ML model to account for multiple variables affecting the results. Lastly, they drew on data from the subjects at the Framingham Heart Study, where the participants are assessed through two-minute Mini-Mental State Examination speech tests every four years and neuropsychological exams every year. CTT examples from FHS, including an unimpaired sample (a), an impaired sample showing telegraphic speech and lack of punctuation (b), and an even more impaired sample showing in addition significant misspellings and minimal grammatical complexity, e.g.
- North America > Canada > Quebec > Montreal (0.06)
- Asia > Japan > Honshū > Kantō > Ibaraki Prefecture > Tsukuba (0.06)
Daily chats with AI could help spot early signs of Alzheimer's
But the earlier it's diagnosed, the more chances there are to delay its progression. Our joint team of researchers from IBM and University of Tsukuba has developed an AI model that could help detect the onset of the mild cognitive impairment (MCI), the transitional stage between normal aging and dementia -- by asking older people typical daily questions. In a new paper published in Frontiers in Digital Health journal, we present the first empirical evidence of tablet-based automatic assessments of patients using speech analysis -- successfully detecting mild cognitive impairment (MCI), the transitional stage between normal aging and dementia. Unlike previous studies, our AI-based model uses speech responses to daily life questions using a smartphone or a tablet app. Such questions could be as simple as inquiring someone about their mood, plans for the day, physical condition or yesterday's dinner. Earlier studies mostly focused on analyzing speech responses during cognitive tests, such as asking a patient to "count down from 925 by threes" or "describe this picture in as much detail as possible."
- Health & Medicine > Therapeutic Area > Neurology > Alzheimer's Disease (0.59)
- Health & Medicine > Therapeutic Area > Neurology > Dementia (0.49)